The following tutorial will let you reproduce the plots that we are going to create at the workshop using R.
Please read carefully and follow the steps. Wherever you see the Code icon on the right you can click on it to see the actual code used in that section.
This tutorial will focus on analysing the updated data of the worldwide Novel Corona virus (COVID-19) pandemic.
There are several data sources available online. We will use the data collected from a range of official sources and hosted on the Our World in Data website (Mathieu et al. 2021).
To run R and RStudio on Binder, click on this badge - .
Start RStudio and create a new project named Workshop3 in a new folder (if you need a reminder ho to do it, check out Workshop1 Tutorial on BB).
Once RStudio restarts inside the project’s folder, create a new R script named Workshop3.R and 2 new folders, one named data for our input data and another named output for our plots.
For this analysis we will again use some packages from the Tidyverse, but this time we load the specific packages (which are supposed to be pre-installed on your computers) to try and avoid having to download the entire tidyverse. In addition to the Tidyverse packages we’ve got to know in the previous workshop we will use the plotly package to create interactive plots, paletteer for custom color palettes, readxl to read MS-Excel file, scales to format large numbers, lubridate to better handle dates, glue to paste together strings, patchwork to include several plots in a single figure and a few others to assist in getting the data into shape.
To install these packages, we will introduce a package called pacman that will assist in loading the required packages and installing them if they’re not already installed. To install it we use the install.packages('pacman') command, please note that the package name need to be quoted and that we only need to perform it once, or when we want or need to update the package. Once the package was installed, we can load its functions using the library(pacman) command and then load/install all the other packages at once with p_load() function.
# install required packages - needed only once! (comment with a # after first use)
install.packages('pacman')
# load required packages
library(pacman)
p_load(tidyverse, paletteer, glue, scales, plotly, lubridate, patchwork, visdat)More information on installing and using R packages can be found in this tutorial.
Now that we’ve got RStudio up and running and our packages installed and loaded, we can read data into R from our local computer or from web locations using dedicated functions specific to the file type (.csv, .txt, .xlsx, etc.).
We will use the read_csv() command/function from the readr package (part of the tidyverse) to load the data directly from a file on the Our World in Data website into a variable of type data frame (table). If we don’t want to use external packages, we can use the read.csv() function from base R, which won’t automatically parse columns containing dates and in previous versions of R (< 4.0) will slightly change the structure of the resulting data frame (all text columns will be converted into factors).
> Note that in this case, we need to specify the column types because the data contains a lot of missing values that interfere with the automatic parsing of the column types._
Let’s use built-in functions for a brief data exploration, such as head() to show the first 10 rows of the data and str() for the type of data in each column (see detailed information on each variable in the data repository on GitHub):
## # A tibble: 6 x 67
## iso_code continent location date total_cases new_cases new_cases_smoot~
## <chr> <fct> <chr> <date> <dbl> <dbl> <dbl>
## 1 AFG Asia Afghanis~ 2020-02-24 5 5 NA
## 2 AFG Asia Afghanis~ 2020-02-25 5 0 NA
## 3 AFG Asia Afghanis~ 2020-02-26 5 0 NA
## 4 AFG Asia Afghanis~ 2020-02-27 5 0 NA
## 5 AFG Asia Afghanis~ 2020-02-28 5 0 NA
## 6 AFG Asia Afghanis~ 2020-02-29 5 0 NA
## # ... with 60 more variables: total_deaths <dbl>, new_deaths <dbl>,
## # new_deaths_smoothed <dbl>, total_cases_per_million <dbl>,
## # new_cases_per_million <dbl>, new_cases_smoothed_per_million <dbl>,
## # total_deaths_per_million <dbl>, new_deaths_per_million <dbl>,
## # new_deaths_smoothed_per_million <dbl>, reproduction_rate <dbl>,
## # icu_patients <dbl>, icu_patients_per_million <dbl>, hosp_patients <dbl>,
## # hosp_patients_per_million <dbl>, weekly_icu_admissions <dbl>, ...
## spec_tbl_df [175,986 x 67] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ iso_code : chr [1:175986] "AFG" "AFG" "AFG" "AFG" ...
## $ continent : Factor w/ 6 levels "Asia","Europe",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ location : chr [1:175986] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ date : Date[1:175986], format: "2020-02-24" "2020-02-25" ...
## $ total_cases : num [1:175986] 5 5 5 5 5 5 5 5 5 5 ...
## $ new_cases : num [1:175986] 5 0 0 0 0 0 0 0 0 0 ...
## $ new_cases_smoothed : num [1:175986] NA NA NA NA NA NA 0.714 0 0 0 ...
## $ total_deaths : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ new_deaths : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ new_deaths_smoothed : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ total_cases_per_million : num [1:175986] 0.126 0.126 0.126 0.126 0.126 0.126 0.126 0.126 0.126 0.126 ...
## $ new_cases_per_million : num [1:175986] 0.126 0 0 0 0 0 0 0 0 0 ...
## $ new_cases_smoothed_per_million : num [1:175986] NA NA NA NA NA NA 0.018 0 0 0 ...
## $ total_deaths_per_million : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ new_deaths_per_million : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ new_deaths_smoothed_per_million : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ reproduction_rate : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ icu_patients : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ icu_patients_per_million : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ hosp_patients : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ hosp_patients_per_million : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_icu_admissions : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_icu_admissions_per_million : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_hosp_admissions : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_hosp_admissions_per_million : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ total_tests : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ total_tests_per_thousand : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests_per_thousand : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests_smoothed : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests_smoothed_per_thousand : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ positive_rate : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ tests_per_case : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ tests_units : chr [1:175986] NA NA NA NA ...
## $ total_vaccinations : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ people_vaccinated : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ people_fully_vaccinated : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ total_boosters : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ new_vaccinations : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ new_vaccinations_smoothed : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ total_vaccinations_per_hundred : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ people_vaccinated_per_hundred : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ people_fully_vaccinated_per_hundred : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ total_boosters_per_hundred : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ new_vaccinations_smoothed_per_million : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ new_people_vaccinated_smoothed : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ new_people_vaccinated_smoothed_per_hundred: num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ stringency_index : num [1:175986] 8.33 8.33 8.33 8.33 8.33 ...
## $ population : num [1:175986] 39835428 39835428 39835428 39835428 39835428 ...
## $ population_density : num [1:175986] 54.4 54.4 54.4 54.4 54.4 ...
## $ median_age : num [1:175986] 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 ...
## $ aged_65_older : num [1:175986] 2.58 2.58 2.58 2.58 2.58 ...
## $ aged_70_older : num [1:175986] 1.34 1.34 1.34 1.34 1.34 ...
## $ gdp_per_capita : num [1:175986] 1804 1804 1804 1804 1804 ...
## $ extreme_poverty : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ cardiovasc_death_rate : num [1:175986] 597 597 597 597 597 ...
## $ diabetes_prevalence : num [1:175986] 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 ...
## $ female_smokers : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ male_smokers : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ handwashing_facilities : num [1:175986] 37.7 37.7 37.7 37.7 37.7 ...
## $ hospital_beds_per_thousand : num [1:175986] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ life_expectancy : num [1:175986] 64.8 64.8 64.8 64.8 64.8 ...
## $ human_development_index : num [1:175986] 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 ...
## $ excess_mortality_cumulative_absolute : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ excess_mortality_cumulative : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ excess_mortality : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## $ excess_mortality_cumulative_per_million : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. iso_code = col_character(),
## .. continent = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. location = col_character(),
## .. date = col_date(format = ""),
## .. total_cases = col_double(),
## .. new_cases = col_double(),
## .. new_cases_smoothed = col_double(),
## .. total_deaths = col_double(),
## .. new_deaths = col_double(),
## .. new_deaths_smoothed = col_double(),
## .. total_cases_per_million = col_double(),
## .. new_cases_per_million = col_double(),
## .. new_cases_smoothed_per_million = col_double(),
## .. total_deaths_per_million = col_double(),
## .. new_deaths_per_million = col_double(),
## .. new_deaths_smoothed_per_million = col_double(),
## .. reproduction_rate = col_double(),
## .. icu_patients = col_double(),
## .. icu_patients_per_million = col_double(),
## .. hosp_patients = col_double(),
## .. hosp_patients_per_million = col_double(),
## .. weekly_icu_admissions = col_double(),
## .. weekly_icu_admissions_per_million = col_double(),
## .. weekly_hosp_admissions = col_double(),
## .. weekly_hosp_admissions_per_million = col_double(),
## .. total_tests = col_double(),
## .. new_tests = col_double(),
## .. total_tests_per_thousand = col_double(),
## .. new_tests_per_thousand = col_double(),
## .. new_tests_smoothed = col_double(),
## .. new_tests_smoothed_per_thousand = col_double(),
## .. positive_rate = col_double(),
## .. tests_per_case = col_double(),
## .. tests_units = col_character(),
## .. total_vaccinations = col_double(),
## .. people_vaccinated = col_double(),
## .. people_fully_vaccinated = col_double(),
## .. total_boosters = col_double(),
## .. new_vaccinations = col_double(),
## .. new_vaccinations_smoothed = col_double(),
## .. total_vaccinations_per_hundred = col_double(),
## .. people_vaccinated_per_hundred = col_double(),
## .. people_fully_vaccinated_per_hundred = col_double(),
## .. total_boosters_per_hundred = col_double(),
## .. new_vaccinations_smoothed_per_million = col_double(),
## .. new_people_vaccinated_smoothed = col_double(),
## .. new_people_vaccinated_smoothed_per_hundred = col_double(),
## .. stringency_index = col_double(),
## .. population = col_double(),
## .. population_density = col_double(),
## .. median_age = col_double(),
## .. aged_65_older = col_double(),
## .. aged_70_older = col_double(),
## .. gdp_per_capita = col_double(),
## .. extreme_poverty = col_double(),
## .. cardiovasc_death_rate = col_double(),
## .. diabetes_prevalence = col_double(),
## .. female_smokers = col_double(),
## .. male_smokers = col_double(),
## .. handwashing_facilities = col_double(),
## .. hospital_beds_per_thousand = col_double(),
## .. life_expectancy = col_double(),
## .. human_development_index = col_double(),
## .. excess_mortality_cumulative_absolute = col_double(),
## .. excess_mortality_cumulative = col_double(),
## .. excess_mortality = col_double(),
## .. excess_mortality_cumulative_per_million = col_double()
## .. )
We can also produce some descriptive statistics to better understand the data and the nature of each variable. The summary() function (as can be guessed by its name) provides a quick summary of basic descriptive statistics, such as the mean, min, max and quantiles for continuous numerical values.
## iso_code continent location
## Length:175986 Asia :37546 Length:175986
## Class :character Europe :38504 Class :character
## Mode :character Africa :40957 Mode :character
## North America:27408
## South America: 9881
## Oceania :11370
## NA's :10320
## date total_cases new_cases new_cases_smoothed
## Min. :2020-01-01 Min. : 1 Min. : 0 Min. : 0
## 1st Qu.:2020-09-15 1st Qu.: 2215 1st Qu.: 1 1st Qu.: 7
## Median :2021-03-28 Median : 29306 Median : 79 Median : 108
## Mean :2021-03-22 Mean : 2769817 Mean : 12308 Mean : 12325
## 3rd Qu.:2021-09-28 3rd Qu.: 323160 3rd Qu.: 1077 3rd Qu.: 1177
## Max. :2022-04-02 Max. :490664062 Max. :4089078 Max. :3436882
## NA's :6309 NA's :6508 NA's :8502
## total_deaths new_deaths new_deaths_smoothed
## Min. : 1 Min. : 0.0 Min. : 0.000
## 1st Qu.: 84 1st Qu.: 0.0 1st Qu.: 0.143
## Median : 822 Median : 2.0 Median : 2.429
## Mean : 60003 Mean : 168.3 Mean : 170.048
## 3rd Qu.: 7670 3rd Qu.: 19.0 3rd Qu.: 21.143
## Max. :6151255 Max. :18156.0 Max. :14795.857
## NA's :24379 NA's :24367 NA's :26527
## total_cases_per_million new_cases_per_million new_cases_smoothed_per_million
## Min. : 0.0 Min. : 0.00 Min. : 0.000
## 1st Qu.: 659.3 1st Qu.: 0.03 1st Qu.: 1.628
## Median : 5152.5 Median : 11.37 Median : 19.317
## Mean : 33126.1 Mean : 179.55 Mean : 178.980
## 3rd Qu.: 41583.2 3rd Qu.: 103.60 3rd Qu.: 125.521
## Max. :706541.9 Max. :51427.49 Max. :16052.608
## NA's :7095 NA's :7294 NA's :9282
## total_deaths_per_million new_deaths_per_million
## Min. : 0.00 Min. : 0.000
## 1st Qu.: 19.99 1st Qu.: 0.000
## Median : 141.91 Median : 0.118
## Mean : 534.91 Mean : 1.669
## 3rd Qu.: 764.43 3rd Qu.: 1.358
## Max. :6363.99 Max. :453.772
## NA's :25152 NA's :25140
## new_deaths_smoothed_per_million reproduction_rate icu_patients
## Min. : 0.000 Min. :-0.04 Min. : 0.0
## 1st Qu.: 0.016 1st Qu.: 0.79 1st Qu.: 31.0
## Median : 0.290 Median : 0.99 Median : 155.0
## Mean : 1.670 Mean : 0.99 Mean : 910.5
## 3rd Qu.: 1.763 3rd Qu.: 1.17 3rd Qu.: 600.0
## Max. :144.167 Max. : 6.13 Max. :28891.0
## NA's :27294 NA's :44834 NA's :152433
## icu_patients_per_million hosp_patients hosp_patients_per_million
## Min. : 0.00 Min. : 0 Min. : 0.00
## 1st Qu.: 4.50 1st Qu.: 140 1st Qu.: 28.81
## Median : 13.71 Median : 774 Median : 91.60
## Mean : 24.05 Mean : 4288 Mean : 170.74
## 3rd Qu.: 34.66 3rd Qu.: 2969 3rd Qu.: 232.24
## Max. :177.28 Max. :154540 Max. :1544.08
## NA's :152433 NA's :151021 NA's :151021
## weekly_icu_admissions weekly_icu_admissions_per_million weekly_hosp_admissions
## Min. : 0.00 Min. : 0.00 Min. : 0
## 1st Qu.: 50.25 1st Qu.: 4.09 1st Qu.: 345
## Median : 218.00 Median : 10.95 Median : 1393
## Mean : 466.72 Mean : 15.26 Mean : 5911
## 3rd Qu.: 659.00 3rd Qu.: 20.12 3rd Qu.: 5345
## Max. :4838.00 Max. :221.21 Max. :154002
## NA's :170268 NA's :170268 NA's :164589
## weekly_hosp_admissions_per_million total_tests new_tests
## Min. : 0.00 Min. : 0 Min. : 1
## 1st Qu.: 24.92 1st Qu.: 323269 1st Qu.: 2150
## Median : 75.36 Median : 1821136 Median : 8673
## Mean :105.08 Mean : 17905115 Mean : 66788
## 3rd Qu.:144.12 3rd Qu.: 8888466 3rd Qu.: 36513
## Max. :645.81 Max. :847109806 Max. :3740296
## NA's :164589 NA's :102386 NA's :104844
## total_tests_per_thousand new_tests_per_thousand new_tests_smoothed
## Min. : 0.00 Min. : 0.00 Min. : 0
## 1st Qu.: 38.59 1st Qu.: 0.27 1st Qu.: 1786
## Median : 202.58 Median : 0.94 Median : 7621
## Mean : 793.00 Mean : 3.25 Mean : 58840
## 3rd Qu.: 768.13 3rd Qu.: 2.91 3rd Qu.: 34008
## Max. :31688.97 Max. :534.01 Max. :3080396
## NA's :102386 NA's :104844 NA's :84662
## new_tests_smoothed_per_thousand positive_rate tests_per_case
## Min. : 0.00 Min. :0.00 Min. : 1.0
## 1st Qu.: 0.23 1st Qu.:0.02 1st Qu.: 7.1
## Median : 0.90 Median :0.06 Median : 17.0
## Mean : 2.89 Mean :0.10 Mean : 147.6
## 3rd Qu.: 2.65 3rd Qu.:0.14 3rd Qu.: 51.0
## Max. :147.60 Max. :0.99 Max. :189932.0
## NA's :84662 NA's :91217 NA's :92085
## tests_units total_vaccinations people_vaccinated
## Length:175986 Min. :0.000e+00 Min. :0.000e+00
## Class :character 1st Qu.:6.521e+05 1st Qu.:4.046e+05
## Mode :character Median :5.194e+06 Median :3.195e+06
## Mean :1.875e+08 Mean :9.453e+07
## 3rd Qu.:3.308e+07 3rd Qu.:1.890e+07
## Max. :1.129e+10 Max. :5.079e+09
## NA's :128425 NA's :130772
## people_fully_vaccinated total_boosters new_vaccinations
## Min. :1.000e+00 Min. :1.000e+00 Min. : 0
## 1st Qu.:2.942e+05 1st Qu.:3.269e+03 1st Qu.: 6005
## Median :2.553e+06 Median :5.558e+05 Median : 40663
## Mean :7.584e+07 Mean :2.408e+07 Mean : 1158698
## 3rd Qu.:1.495e+07 3rd Qu.:5.046e+06 3rd Qu.: 276462
## Max. :4.565e+09 Max. :1.652e+09 Max. :54498879
## NA's :133413 NA's :155893 NA's :136672
## new_vaccinations_smoothed total_vaccinations_per_hundred
## Min. : 0 Min. : 0.00
## 1st Qu.: 992 1st Qu.: 13.54
## Median : 8918 Median : 63.00
## Mean : 508364 Mean : 77.09
## 3rd Qu.: 64254 3rd Qu.:129.33
## Max. :43567880 Max. :348.27
## NA's :85734 NA's :128425
## people_vaccinated_per_hundred people_fully_vaccinated_per_hundred
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 9.50 1st Qu.: 5.73
## Median : 38.82 Median : 29.47
## Mean : 39.32 Mean : 34.07
## 3rd Qu.: 66.15 3rd Qu.: 60.04
## Max. :124.77 Max. :122.54
## NA's :130772 NA's :133413
## total_boosters_per_hundred new_vaccinations_smoothed_per_million
## Min. : 0.00 Min. : 0
## 1st Qu.: 0.02 1st Qu.: 636
## Median : 4.24 Median : 2071
## Mean : 14.24 Mean : 3217
## 3rd Qu.: 25.25 3rd Qu.: 4598
## Max. :100.96 Max. :117497
## NA's :155893 NA's :85734
## new_people_vaccinated_smoothed new_people_vaccinated_smoothed_per_hundred
## Min. : 0 Min. : 0.00
## 1st Qu.: 386 1st Qu.: 0.02
## Median : 3663 Median : 0.07
## Mean : 202883 Mean : 0.14
## 3rd Qu.: 25216 3rd Qu.: 0.18
## Max. :21377378 Max. :11.75
## NA's :86606 NA's :86606
## stringency_index population population_density median_age
## Min. : 0.00 Min. :4.700e+01 Min. : 0.137 Min. :15.10
## 1st Qu.: 39.81 1st Qu.:9.029e+05 1st Qu.: 37.312 1st Qu.:22.30
## Median : 54.17 Median :7.553e+06 Median : 87.324 Median :30.60
## Mean : 54.06 Mean :1.444e+08 Mean : 460.409 Mean :30.66
## 3rd Qu.: 70.37 3rd Qu.:3.278e+07 3rd Qu.: 214.243 3rd Qu.:39.10
## Max. :100.00 Max. :7.875e+09 Max. :20546.766 Max. :48.20
## NA's :38448 NA's :1103 NA's :19223 NA's :30586
## aged_65_older aged_70_older gdp_per_capita extreme_poverty
## Min. : 1.14 Min. : 0.526 Min. : 661.2 Min. : 0.10
## 1st Qu.: 3.53 1st Qu.: 2.063 1st Qu.: 4449.9 1st Qu.: 0.60
## Median : 6.70 Median : 4.209 Median : 13111.2 Median : 2.20
## Mean : 8.85 Mean : 5.580 Mean : 19687.7 Mean :13.59
## 3rd Qu.:14.31 3rd Qu.: 9.167 3rd Qu.: 27936.9 3rd Qu.:21.20
## Max. :27.05 Max. :18.493 Max. :116935.6 Max. :77.60
## NA's :32136 NA's :31353 NA's :31109 NA's :81276
## cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers
## Min. : 79.37 Min. : 0.990 Min. : 0.10 Min. : 7.70
## 1st Qu.:168.71 1st Qu.: 5.350 1st Qu.: 1.90 1st Qu.:21.60
## Median :243.81 Median : 7.200 Median : 6.30 Median :31.40
## Mean :259.95 Mean : 8.355 Mean :10.64 Mean :32.78
## 3rd Qu.:329.94 3rd Qu.:10.590 3rd Qu.:19.30 3rd Qu.:41.30
## Max. :724.42 Max. :30.530 Max. :44.00 Max. :78.10
## NA's :30883 NA's :23980 NA's :65791 NA's :67302
## handwashing_facilities hospital_beds_per_thousand life_expectancy
## Min. : 1.19 Min. : 0.10 Min. :53.28
## 1st Qu.: 20.86 1st Qu.: 1.30 1st Qu.:69.59
## Median : 49.84 Median : 2.40 Median :75.09
## Mean : 50.85 Mean : 3.03 Mean :73.67
## 3rd Qu.: 82.50 3rd Qu.: 4.00 3rd Qu.:79.19
## Max. :100.00 Max. :13.80 Max. :86.75
## NA's :104581 NA's :47324 NA's :11466
## human_development_index excess_mortality_cumulative_absolute
## Min. :0.39 Min. : -37726
## 1st Qu.:0.60 1st Qu.: -45
## Median :0.74 Median : 3526
## Mean :0.73 Mean : 38593
## 3rd Qu.:0.84 3rd Qu.: 26111
## Max. :0.96 Max. :1111865
## NA's :34265 NA's :169963
## excess_mortality_cumulative excess_mortality
## Min. :-28.45 Min. :-95.92
## 1st Qu.: -0.46 1st Qu.: -0.58
## Median : 6.29 Median : 7.42
## Mean : 9.56 Mean : 15.84
## 3rd Qu.: 14.61 3rd Qu.: 22.48
## Max. :111.01 Max. :375.00
## NA's :169963 NA's :169963
## excess_mortality_cumulative_per_million
## Min. :-1826.60
## 1st Qu.: -19.13
## Median : 508.82
## Mean : 1023.02
## 3rd Qu.: 1690.75
## Max. : 9339.47
## NA's :169963
## # A tibble: 175,986 x 67
## iso_code continent location date total_cases new_cases new_cases_smoot~
## <chr> <fct> <chr> <date> <dbl> <dbl> <dbl>
## 1 OWID_WRL <NA> World 2022-04-02 490664062 1055332 1406584.
## 2 OWID_WRL <NA> World 2022-04-01 489608730 1170974 1449590.
## 3 OWID_WRL <NA> World 2022-03-31 488437812 1874508 1545464.
## 4 OWID_WRL <NA> World 2022-03-30 486563304 1589756 1527973.
## 5 OWID_WRL <NA> World 2022-03-29 484973548 1750295 1554157.
## 6 OWID_WRL <NA> World 2022-03-28 483223253 1493115 1586314.
## 7 OWID_WRL <NA> World 2022-03-27 481730138 912111 1583087.
## 8 OWID_WRL <NA> World 2022-03-26 480818027 1356374 1605688.
## 9 OWID_WRL <NA> World 2022-03-25 479461653 1842090 1656139.
## 10 OWID_WRL <NA> World 2022-03-24 477619563 1752069 1666355.
## # ... with 175,976 more rows, and 60 more variables: total_deaths <dbl>,
## # new_deaths <dbl>, new_deaths_smoothed <dbl>, total_cases_per_million <dbl>,
## # new_cases_per_million <dbl>, new_cases_smoothed_per_million <dbl>,
## # total_deaths_per_million <dbl>, new_deaths_per_million <dbl>,
## # new_deaths_smoothed_per_million <dbl>, reproduction_rate <dbl>,
## # icu_patients <dbl>, icu_patients_per_million <dbl>, hosp_patients <dbl>,
## # hosp_patients_per_million <dbl>, weekly_icu_admissions <dbl>, ...
What are the metadata columns that describe our observations?
continent
location
date
Why do we have observations with the continent as NA?
# check which location have continent as NA
covid_data %>% filter(is.na(continent)) %>% count(location)## # A tibble: 13 x 2
## location n
## <chr> <int>
## 1 Africa 780
## 2 Asia 802
## 3 Europe 801
## 4 European Union 801
## 5 High income 802
## 6 International 786
## 7 Low income 770
## 8 Lower middle income 802
## 9 North America 802
## 10 Oceania 799
## 11 South America 771
## 12 Upper middle income 802
## 13 World 802
Some rows contain summarised data of entire continents/World, we'll need to remove those
We can see that most of our data contains ‘0’ (check the difference between the median and the mean in total_cases and total_deaths columns). Just to confirm that, let’s plot a histogram of all the confirmed cases
ggplot(covid_data, aes(x=total_cases)) +
geom_histogram(fill="lightskyblue") +
theme_bw(def_text_size)The data is evolving over days (a time-series), to there’s no point treating it as a random population sample.
Let’s look at confirmed cases and total deaths data for the 10 most affected countries (to date). To find out these countries so we need to wrangle our data a little bit using the following steps:
filter(!is.na(continent)group_by()arrange(desc(date))slice(1) and remove grouping with ungroup()inner_join()Optional step:
location variable as a factor and order it so the countries will be ordered in the legend by the number of cases withThen we can look at the data as a table and make a plot with the number of cases in the y-axis and date in the x-axis.
# find the 10 most affected countries (to date)
latest_data <- covid_data %>% filter(!is.na(continent)) %>%
group_by(location) %>% arrange(desc(date)) %>% slice(1) %>% ungroup()
most_affected_countries <- latest_data %>%
arrange(desc(total_cases)) %>% slice(1:10) %>%
select(location)
# subset just the data from the 10 most affected countries and order them from the most affected to the least one
most_affected_data <- covid_data %>%
inner_join(most_affected_countries) %>%
mutate(Country=factor(location, levels = most_affected_countries$location))
# create a line plot the data of total deaths
ggplot(most_affected_data, aes(x=date, y=total_cases, colour=Country)) +
geom_line(size=0.75) + scale_y_continuous(labels=comma) +
scale_color_paletteer_d("rcartocolor::Bold") +
labs(color="Country", y = "Total COVID-19 cases") +
theme_bw(def_text_size)It’s a bit hard to figure out how the pandemic evolved because the numbers in US, Brazil and India are an order of magnitude larger than the rest (which are very close to each other). How can we make it more visible (and also improve how of the dates appear in the x-axis)?
# better formatting of date axis, log scale
plot <- ggplot(most_affected_data, aes(x=date, y=total_cases, colour=Country)) +
geom_line(size=0.75) + scale_y_log10(labels=comma) +
scale_x_date(NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
scale_color_paletteer_d("rcartocolor::Bold") +
labs(color="Country", y = "Total COVID-19 cases") +
theme_bw(def_text_size)
# show an interactive plot
ggplotly(plot)Why did we get a warning message and why the graphs don’t start at the bottom of the x-axis? How can we solve it? What can we infer from the graph (exponential increase)?
What happens when we take the log of 0?? Can we remove those 0s with the `filter()` function (or add a very small number to them)?
We can see a very similar trend for most countries and while the curve has flattened substantially in April last year, the numbers are still rising. It is also evident that Europe got hit by a second wave arount October last year and India in April this year.
Let’s have a look at the total deaths in these countries (and get rid of the minor grid lines to make Frank happy)
# create a line plot the data of total deaths
ggplot(most_affected_data, aes(x=date, y=total_deaths, colour=Country)) +
geom_line(size=0.75) + scale_y_continuous(labels=comma) +
scale_x_date(NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
scale_color_paletteer_d("rcartocolor::Bold") +
labs(color="Country", y = "Total deaths") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank()) # remove minor grid linesLet’s have a look at the number of vaccinated people.
# vaccination rates
ggplot(most_affected_data, aes(x=date, y=people_vaccinated, colour=Country)) +
geom_line(size=0.75) + scale_y_continuous(labels=comma) +
scale_color_paletteer_d("rcartocolor::Bold") +
scale_x_date(NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
labs(color="Country", y="People Vaccinated") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank())The graphs are “broken”, meaning that it is not continuous and we have some missing data.
Let’s visualise some of the variables in our data and assess “missingness”.
# visualise missingness
vis_dat(covid_data %>% filter(date>dmy("01-01-2021")) %>%
select(continent, location, total_cases, total_deaths,
hosp_patients, people_vaccinated, people_fully_vaccinated))# find which countries has the most number of observations (least missing data)
covid_data %>% filter(!is.na(continent), !is.na(people_vaccinated)) %>% # group_by(location) %>%
count(location) %>% arrange(desc(n)) %>% print(n=30)## # A tibble: 219 x 2
## location n
## <chr> <int>
## 1 Norway 484
## 2 United States 473
## 3 Latvia 472
## 4 Denmark 471
## 5 Israel 468
## 6 Liechtenstein 465
## 7 Switzerland 465
## 8 Canada 462
## 9 Chile 461
## 10 Czechia 460
## 11 Estonia 460
## 12 Germany 460
## 13 Italy 460
## 14 Lithuania 460
## 15 Slovenia 460
## 16 France 459
## 17 Argentina 458
## 18 Belgium 457
## 19 Singapore 454
## 20 Ireland 452
## 21 United Kingdom 444
## 22 Greece 440
## 23 India 430
## 24 Brazil 428
## 25 Malta 427
## 26 Indonesia 425
## 27 Ecuador 423
## 28 Bolivia 419
## 29 Peru 415
## 30 Bahrain 412
## # ... with 189 more rows
covid_data %>% filter(!is.na(continent), !is.na(hosp_patients)) %>% # group_by(location) %>%
count(location) %>% arrange(desc(n)) %>% print(n=30)## # A tibble: 38 x 2
## location n
## <chr> <int>
## 1 Italy 769
## 2 Estonia 765
## 3 Netherlands 764
## 4 Israel 762
## 5 Canada 753
## 6 Czechia 753
## 7 Hungary 753
## 8 Slovakia 752
## 9 Cyprus 749
## 10 Ireland 748
## 11 Slovenia 748
## 12 Belgium 747
## 13 France 746
## 14 Sweden 744
## 15 Luxembourg 742
## 16 Portugal 741
## 17 Malaysia 740
## 18 United Kingdom 735
## 19 Australia 733
## 20 Austria 732
## 21 Serbia 732
## 22 Switzerland 732
## 23 Denmark 730
## 24 Latvia 727
## 25 Bulgaria 721
## 26 Croatia 712
## 27 Poland 702
## 28 Malta 690
## 29 Norway 642
## 30 United States 626
## # ... with 8 more rows
Now we can focus on a subset of countries that have more complete vaccination and hospitalisation rates data, so we could compare them.
countries_subset <- c("Italy", "United States", "Israel", "United Kingdom", "France", "Czechia")
# subset our original data to these countries
hosp_data <- covid_data %>% filter(location %in% countries_subset)
# define a new colour palette for these countries
col_pal <- "ggsci::category10_d3"Let’s look at hospitalisation rates first.
ggplot(hosp_data, aes(x=date, y = hosp_patients,colour=location)) +
geom_line(size=0.75) + scale_y_continuous(labels=comma) +
scale_color_paletteer_d(col_pal) +
scale_x_date(name = NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
labs(color="Country", y = "Hospitalised patients") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank())Can you identify the “waves” in each country?
It’s hard to see the details in the countries with lower number of hospitalised patients, how can we improve the visualisation?
Look at hospitalision rates proportional to the population size!
# hosp per population size
p1 <- ggplot(hosp_data,
aes(x=date, y = hosp_patients_per_million,colour=location)) +
geom_line(size=0.75) +
scale_y_continuous(labels=comma) +
scale_color_paletteer_d(col_pal) +
scale_x_date(name = NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
labs(color="Country", y = "Hospitalised patients (per million)") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank())
p1Now let’s try to compare this to vaccination rates
# total vaccination per population
p2 <- ggplot(hosp_data,
aes(x=date, y = people_fully_vaccinated_per_hundred/100 ,colour=location)) +
geom_line(size=0.75, linetype="dashed") +
guides(color = guide_legend(override.aes = list(linetype="solid") ) ) +
scale_y_continuous(labels=percent) +
scale_color_paletteer_d(col_pal) +
scale_x_date(name = NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
labs(color="Country", y = "Fully vaccinated (percent)") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank())
p2What will be the best way to compare these values? Maybe like this:
(p1 + guides(color=FALSE)) / (p2 + theme(legend.position = "bottom")) #+ plot_layout(guides = 'collect')# show graphs on top of each otherAny other suggestions? Let’s try to present them on the same graph (note the trick with the secondary y-axis).
# show on the same graph
p4 <- ggplot(hosp_data,
aes(x=date, colour=location)) +
geom_line(aes(y = hosp_patients_per_million), size=0.75) +
geom_line(aes(y = people_fully_vaccinated_per_hundred*10), size=0.75, linetype="dashed") +
scale_y_continuous(labels=comma, name = "Hospitalised patients per million (solid)",
# Add a second axis and specify its features
sec.axis = sec_axis(trans=~./10, name="Fully vaccinated per hundred (dashed)")) +
scale_color_paletteer_d(col_pal) +
scale_x_date(NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
labs(color="Country") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank())
p4Probably best to present them on the same graph, but for each country separately (done with the facet_wrap() function).
# show on the same graph, but separate each country
p4 +
scale_x_date(NULL,
breaks = breaks_width("4 months"),
labels = label_date_short()) +
facet_wrap(~location)Save the plot to a file.
# create output folder
dir.create("./output", showWarnings = FALSE)
# save the plot to pdf file
ggsave("output/hospit_vacc_rates_facet_country.pdf", width=14, height = 8)# subset our original data to our countries + Europe (European average)
mort_data <- covid_data %>%
filter(location %in% c(countries_subset, "Europe")) %>%
select(date, location, new_deaths_smoothed_per_million)
# Get the EU new death rates
EU_mortality_data <- covid_data %>% filter(location=="Europe") %>%
select(date, EU_dpm=new_deaths_smoothed_per_million)
# create a new calculated column with the difference from the EU average
comp_mort_data <- mort_data %>% left_join(EU_mortality_data) %>%
mutate(new_mort_change=new_deaths_smoothed_per_million-EU_dpm,
location=fct_relevel(factor(location), "Europe", after = Inf))
mort_plot <- ggplot(comp_mort_data,
aes(x=date, y = new_mort_change ,colour=location)) +
geom_line(size=0.75) +
guides(color = guide_legend(override.aes = list(linetype="solid") ) ) +
# scale_y_continuous(labels=percent) +
scale_color_paletteer_d(col_pal) +
scale_x_date(name = NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
labs(color="Country",
y = "New COVID-related deaths (in comparison to the EU average)") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank())
ggplotly(mort_plot)Save the plot to a file.
# save the plot to pdf file
ggsave("output/mortality_rates_vs_EU_average.pdf", width=12, height = 8)Plot for each country separately:
ggplot(comp_mort_data %>% filter(location!="Europe"),
aes(x=date, y = new_mort_change ,colour=location)) +
geom_line(size=0.75) +
guides(color = guide_legend(override.aes = list(linetype="solid") ) ) +
# scale_y_continuous(labels=percent) +
scale_color_paletteer_d(col_pal) +
scale_x_date(name = NULL,
breaks = breaks_width("4 months"),
labels = label_date_short()) +
labs(color="Country",
y = "New COVID-related deaths (in comparison to the EU average)") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank()) +
facet_wrap(~location, ncol = 2) +
geom_hline(yintercept = 0)# save the plot to pdf file
ggsave("output/mortality_rates_vs_EU_average_facets.pdf", width=11, height = 8)1. Mortalities (Case Fatality Rate)?
2. Suggestions? (cases per population density, vaccination rates by country income, deaths by number of beds per capita, etc.
3. Level of reporting in each country...
Best way to conclude the presentation about our new Bachelor program in Environmental Data Science in @CZUvPraze. Small edit in the programming language used in the original strip of @xkcdComic. Sorry Pearl users :) #DataScience #rstats @FesCuls pic.twitter.com/KBkjLgFNwR
— Yannis Markonis (@YannisMarkonis) June 25, 2020
Please contact me at i.bar@griffith.edu.au for any questions or comments.
Mathieu E, Ritchie H, Ortiz-Ospina E, et al. (2021) A global database of COVID-19 vaccinations. Nat Hum Behav 5:947–953. doi: 10.1038/s41562-021-01122-8